Commit a1e56ef6 authored by Ophélie Gagnard's avatar Ophélie Gagnard

Summary: Make a draft version work with a Cython+ program instead of a Rust program.

Add project scan-filesystem (from Xavier Thompson's gitlab) which implements a filesystem scanner.
Modify Makefile and call to a script allow to compile it without linking it from Python functions.
Modify dracut.module/Makefile to compile scan-filesystem instead of Leo's Rust code.
parent c2bf2108
...@@ -5,7 +5,8 @@ include collect-sh-template.mk ...@@ -5,7 +5,8 @@ include collect-sh-template.mk
echo "$${collect_sh}" >> 90metadata-collect/collect.sh echo "$${collect_sh}" >> 90metadata-collect/collect.sh
90metadata-collect/metadata-collect-agent: 90metadata-collect/metadata-collect-agent:
cd ../ && ./rust-build-static.bash # cd ../ && ./rust-build-static.bash
cd ../scan-filesyttem/cython/ && make nopython && cd ../../dracut.module
90metadata-collect/fluent-bit: 90metadata-collect/fluent-bit:
cd ../ && ./fluent-bit-install.sh && cd dracut.module cd ../ && ./fluent-bit-install.sh && cd dracut.module
......
*.cpp
*.c
*.h
*.html
*.o
*.exe
*.lock
*.json
*.out
rust/target/
# A Comparison program between Cython+, Python and Rust
As a way to write a Cython+ program under real conditions, we ported an existing small program already written in Python and Rust to Cython+.
The program was chosen because it is short and because it is very conducive to parallel execution, allowing us to try out Cython+'s approach to concurrent programming.
## What it does
The program in question scans a filesystem from the root down (without following symbolic links) and:
- stores the results of a `stat` system call for each directory, file and symbolic link,
- computes messages digests from each file using `md5`, `sha1`, `sh256` and `sha512` algorithms,
- stores the target path of each symbolic link,
A JSON representation of the gathered information is then dumped in a file.
As Cython+ lacks an argument parser, the root directory is hardcoded as `/` and the output file as `result.json`.
The Python and Rust programs originally were making use of an argument parser, gathering additional information, and then uploading the result online, but they were simplified to match the description above because Cython+ currently lacks a good standard library, substantially increasing the required development work for even seemingly basic tasks.
## Building the Cython+ version
### Dependencies:
The program works on linux with Python3.7 and depends on the following:
- openssl library
- [fmt library](https://fmt.dev/latest/index.html)
- python3.7 development headers, which depending on the distribution may be packaged separately from python itself
- [Nexedi's cython+ compiler](https://lab.nexedi.com/nexedi/cython)
The paths to the openssl and fmt libraries should be updated in `cython/Makefile` and `cython/setup.py`.
The path to the python development headers shoudl be updated in `cython/Makefile`.
The path to the cython+ compiler should be added to environment variable `PYTHONPATH`.
### Building
First go to the cython subdirectory:
```
$ cd cython/
```
To build:
```
$ make
```
Or equivalently: `python3 setup.py build_ext --inplace`
To run:
```
$ make run
```
### Building without the Python runtime
The Cython compiler systematically links compiled programs against the python runtime, however this program never actually calls into the Python runtime.
There is no support yet in the Cython+ language to express independance from the Python runtime and generate code which does not need to include the Python headers. However in the meantime we introduce a hack: we only use the cython compiler to generate a C++ file which we then compile and link manually, but instead of providing the python runtime to link against we tell the linker to ignore linking errors. We also provide a `main` function as an entry point to bypass the entry point for the python interpreter.
This hack can only work because this program happens not to require the python runtime for anything.
No guarantees whatsoever are made that the program produced will behave as expected.
It has been tested and shown to work with g++.
To build:
```
$ make nopython
```
To run:
```
$ make runnopython
```
ifeq (,)
INCLUDE_DIRS = -I/usr/include/python3.7
else
INCLUDE_PYTHON = -I/srv/slapgrid/slappart6/srv/runner/shared/python3/2e435adf7e2cb7d97668da52532ac7f3/include/python3.7m
OPENSSL_PATH = /srv/slapgrid/slappart6/srv/runner/shared/openssl/24bd61db512fe6e4e0d214ae77943d75
INCLUDE_OPENSSL = -I$(OPENSSL_PATH)/include
LIBRARY_OPENSSL = -L$(OPENSSL_PATH)/lib
RUNPATH_OPENSSL = -Wl,-rpath=$(OPENSSL_PATH)/lib
FMTLIB_PATH = /srv/slapgrid/slappart6/srv/runner/shared/fmtlib/d524cc3d1a798a140778558556ec6d0c
INCLUDE_FMTLIB = -I$(FMTLIB_PATH)/include
LIBRARY_FMTLIB = -L$(FMTLIB_PATH)/lib
RUNPATH_FMTLIB = -Wl,-rpath=$(FMTLIB_PATH)/lib
INCLUDE_DIRS = $(INCLUDE_PYTHON) $(INCLUDE_OPENSSL) $(INCLUDE_FMTLIB)
LIBPATHS = $(LIBRARY_OPENSSL) $(LIBRARY_FMTLIB)
RUNPATHS = $(RUNPATH_OPENSSL) $(RUNPATH_FMTLIB)
LDFLAGS = $(LIBPATHS) $(RUNPATHS)
endif
#EXE = main
#CXX = g++
#CPPFLAGS = -O2 -g -Wno-unused-result -Wsign-compare -pthread $(INCLUDE_DIRS)
#LDFLAGS += -Wl,--unresolved-symbols=ignore-all
LDLIBS = -lcrypto -lfmt
EXT_SUFFIX := $(shell python3 -c "import sysconfig; print(sysconfig.get_config_var('EXT_SUFFIX'))")
EXT = $(EXE)$(EXT_SUFFIX)
# Build with Python runtime
all: $(EXT)
$(EXT): setup.py
@echo "[Cython Compiling $^ -> $@]"
python3 setup.py build_ext --inplace
# Run with Python runtime
run: $(EXT)
python3 -c "import $(EXE); $(EXE).python_main()" 2>/dev/null
# Build without Python runtime
nopython: main.cpp#$(EXE)
mkdir -p logs
-g++ -O2 -g -Wno-unused-result -Wsign-compare -pthread -I/usr/include/python3.7 main.cpp -lcrypto -lfmt -o main 2> logs/link_errors
./parse_link_errors.py < logs/link_errors
make fake_python.o
g++ -O2 -g -Wno-unused-result -Wsign-compare -pthread -I/usr/include/python3.7 main.cpp fake_python.o -lcrypto -lfmt -o main
%.cpp: %.pyx
@echo "[Cython Compiling $^ -> $@]"
python3 -c "from Cython.Compiler.Main import main; main(command_line=1)" $^ --cplus -3
@rm -f $(subst .cpp,.h,$@)
%: %.cpp
@echo "[C++ Compiling $^ -> $@]"
$(LINK.cpp) $^ $(LOADLIBES) $(LDLIBS) -o $@
# Run without Python runtime
runnopython: $(EXE)
./$(EXE) 2>/dev/null
clean:
-rm -f *.c *.cpp *.html
-rm -f *.h
-rm -f *.so
-rm -f $(EXE)
-rm -f *.o
-rm -f -r build
-rm -f *.json
.PHONY: all run nopython runnopython clean
.PRECIOUS: %.cpp
This source diff could not be displayed because it is too large. You can view the blob instead.
# distutils: language = c++
from libcythonplus.list cimport cyplist
from libc.stdio cimport fprintf, fopen, fclose, fread, fwrite, FILE, stdout, printf, ferror
from runtime.runtime cimport SequentialMailBox, BatchMailBox, NullResult, Scheduler
from stdlib.stat cimport Stat, dev_t
from stdlib.digest cimport MessageDigest, md5sum, sha1sum, sha256sum, sha512sum
from stdlib.fmt cimport sprintf
from stdlib.string cimport string
from stdlib.dirent cimport DIR, struct_dirent, opendir, readdir, closedir
from posix.unistd cimport readlink
cdef lock Scheduler scheduler
cdef cypclass Node activable:
string path
string name
Stat st
string formatted
__init__(self, string path, string name, Stat st):
self._active_result_class = NullResult
self._active_queue_class = consume BatchMailBox(scheduler)
self.path = path
self.name = name
self.st = st
void build_node(self, lock cyplist[dev_t] dev_whitelist, lock cyplist[string] ignore_paths):
# abstract
pass
void format_node(self):
self.formatted = sprintf("""\
{
"%s": {
"stat": %s
}
},
""",
self.path,
self.st.to_json(),
)
void write_node(self, FILE * stream):
# abstract
pass
cdef iso Node make_node(string path, string name) nogil:
s = Stat(path)
if s is NULL:
return NULL
elif s.is_symlink():
return consume SymlinkNode(path, name, consume s)
elif s.is_dir():
return consume DirNode(path, name, consume s)
elif s.is_regular():
return consume FileNode(path, name, consume s)
return NULL
cdef cypclass DirNode(Node):
cyplist[active Node] children
__init__(self, string path, string name, Stat st):
Node.__init__(self, path, name, st)
self.children = new cyplist[active Node]()
self.children.__init__()
void build_node(self, lock cyplist[dev_t] dev_whitelist, lock cyplist[string] ignore_paths):
cdef DIR *d
cdef struct_dirent *entry
cdef string entry_name
cdef string entry_path
if ignore_paths is not NULL:
if self.path in ignore_paths:
return
if dev_whitelist is not NULL:
if self.st is NULL:
return
elif not self.st.st_data.st_dev in dev_whitelist:
return
d = opendir(self.path.c_str())
if d is not NULL:
while 1:
entry = readdir(d)
if entry is NULL:
break
entry_name = entry.d_name
if entry_name == b'.' or entry_name == b'..':
continue
entry_path = self.path
if entry_path != b'/':
entry_path += b'/'
entry_path += entry_name
entry_node = make_node(entry_path, entry_name)
if entry_node is NULL:
continue
active_entry = activate(consume entry_node)
self.children.append(active_entry)
closedir(d)
self.format_node()
for active_child in self.children:
active_child.build_node(NULL, dev_whitelist, ignore_paths)
void write_node(self, FILE * stream):
fwrite(self.formatted.data(), 1, self.formatted.size(), stream)
while self.children.__len__() > 0:
active_child = self.children[self.children.__len__() -1]
del self.children[self.children.__len__() -1]
child = consume active_child
child.write_node(stream)
cdef enum:
BUFSIZE = 64 * 1024
cdef cypclass FileNode(Node):
string md5_data
string sha1_data
string sha256_data
string sha512_data
bint error
__init__(self, string path, string name, Stat st):
Node.__init__(self, path, name, st)
self.error = False
void build_node(self, lock cyplist[dev_t] dev_whitelist, lock cyplist[string] ignore_paths):
cdef unsigned char buffer[BUFSIZE]
cdef bint eof = False
cdef bint md5_ok
cdef bint sha1_ok
cdef bint sha256_ok
cdef bint sha512_ok
cdef FILE * file = fopen(self.path.c_str(), 'rb')
if file is NULL:
self.error = True
self.format_node()
return
md5 = MessageDigest(md5sum())
sha1 = MessageDigest(sha1sum())
sha256 = MessageDigest(sha256sum())
sha512 = MessageDigest(sha512sum())
md5_ok = md5 is not NULL
sha1_ok = sha1 is not NULL
sha256_ok = sha256 is not NULL
sha512_ok = sha512 is not NULL
while not eof and (md5_ok or sha1_ok or sha256_ok or sha512_ok):
size = fread(buffer, 1, BUFSIZE, file)
if size != BUFSIZE:
self.error = ferror(file)
if self.error:
break
eof = True
if md5_ok: md5_ok = md5.update(buffer, size) == 0
if sha1_ok: sha1_ok = sha1.update(buffer, size) == 0
if sha256_ok: sha256_ok = sha256.update(buffer, size) == 0
if sha512_ok: sha512_ok = sha512.update(buffer, size) == 0
fclose(file)
if not self.error:
if md5_ok: self.md5_data = md5.hexdigest()
if sha1_ok: self.sha1_data = sha1.hexdigest()
if sha256_ok: self.sha256_data = sha256.hexdigest()
if sha512_ok: self.sha512_data = sha512.hexdigest()
self.format_node()
void format_node(self):
if self.error:
Node.format_node(self)
else:
self.formatted = sprintf("""\
{
"%s": {
"stat": %s,
"digests": {
"md5": "%s",
"sha1": "%s",
"sha256": "%s",
"sha512": "%s"
}
}
},
""",
self.path,
self.st.to_json(),
self.md5_data,
self.sha1_data,
self.sha256_data,
self.sha512_data,
)
void write_node(self, FILE * stream):
fwrite(self.formatted.data(), 1, self.formatted.size(), stream)
cdef cypclass SymlinkNode(Node):
string target
int error
void build_node(self, lock cyplist[dev_t] dev_whitelist, lock cyplist[string] ignore_paths):
size = self.st.st_data.st_size + 1
self.target.resize(size)
real_size = readlink(self.path.c_str(), <char*> self.target.data(), size)
self.error = not (0 < real_size < size)
self.target.resize(real_size)
self.format_node()
void format_node(self):
if self.error:
Node.format_node(self)
else:
self.formatted = sprintf("""\
{
"%s": {
"stat": %s,
"target": "%s"
}
},
""",
self.path,
self.st.to_json(),
self.target,
)
void write_node(self, FILE * stream):
fwrite(self.formatted.data(), 1, self.formatted.size(), stream)
cdef int start(string path) nogil:
global scheduler
scheduler = Scheduler()
ignore_paths = cyplist[string]()
ignore_paths.append(b'/opt/slapgrid')
ignore_paths.append(b'/srv/slapgrid')
dev_whitelist_paths = cyplist[string]()
dev_whitelist_paths.append(b'.')
dev_whitelist_paths.append(b'/')
dev_whitelist_paths.append(b'/boot')
dev_whitelist = cyplist[dev_t]()
for p in dev_whitelist_paths:
p_stat = Stat(p)
if p_stat is not NULL:
p_dev = p_stat.st_data.st_dev
dev_whitelist.append(p_dev)
node = make_node(path, path)
if node is NULL:
return -1
active_node = activate(consume node)
active_node.build_node(NULL, consume dev_whitelist, consume ignore_paths)
scheduler.finish()
node = consume active_node
result = fopen('result.json', 'w')
if result is NULL:
return -1
fprintf(result, '[\n')
node.write_node(result)
fprintf(result, ' {}\n]\n')
fclose(result)
del scheduler
return 0
cdef public int main() nogil:
return start(<char*>'.')
def python_main():
start(<char*>'.')
#!/usr/bin/env python3
import sys
import re
def parse_input():
p = re.compile(r'undefined reference to `(.*(Py).*)\'')
functions_to_implement = set()
while True:
try:
m = p.search(input())
except EOFError:
break
if m != None:
functions_to_implement.update({m.group(1)})
return functions_to_implement
def produce_code(parsed_input, file_name='fake_python.c'):
f = open(file_name, "w")
for function in parsed_input:
f.write('void ' + function + '(){}\n')
f.close()
if __name__ == '__main__':
parsed_input = parse_input()
produce_code(parsed_input)
cdef extern from "<sys/types.h>" nogil:
ctypedef long unsigned int pthread_t
ctypedef union pthread_attr_t:
pass
ctypedef union pthread_mutex_t:
pass
ctypedef union pthread_mutexattr_t:
pass
ctypedef union pthread_barrier_t:
pass
ctypedef union pthread_barrierattr_t:
pass
ctypedef union pthread_cond_t:
pass
ctypedef union pthread_condattr_t:
pass
cdef extern from "<pthread.h>" nogil:
int pthread_create(pthread_t *, const pthread_attr_t *, void *(*)(void *), void *)
void pthread_exit(void *)
int pthread_join(pthread_t, void **)
int pthread_cancel(pthread_t thread)
int pthread_attr_init(pthread_attr_t *)
int pthread_attr_setdetachstate(pthread_attr_t *, int)
int pthread_attr_destroy(pthread_attr_t *)
int pthread_mutex_init(pthread_mutex_t *, const pthread_mutexattr_t *)
int pthread_mutex_destroy(pthread_mutex_t *)
int pthread_mutex_lock(pthread_mutex_t *)
int pthread_mutex_unlock(pthread_mutex_t *)
int pthread_mutex_trylock(pthread_mutex_t *)
int pthread_barrier_init(pthread_barrier_t *, const pthread_barrierattr_t *, unsigned int)
int pthread_barrier_destroy(pthread_barrier_t *)
int pthread_barrier_wait(pthread_barrier_t *)
int pthread_cond_init(pthread_cond_t * cond, const pthread_condattr_t * attr)
int pthread_cond_destroy(pthread_cond_t *cond)
int pthread_cond_wait(pthread_cond_t * cond, pthread_mutex_t * mutex)
int pthread_cond_broadcast(pthread_cond_t *cond)
int pthread_cond_signal(pthread_cond_t *cond)
enum: PTHREAD_CREATE_JOINABLE
\ No newline at end of file
# distutils: language = c++
from libcpp.deque cimport deque
from libcpp.vector cimport vector
from libcpp.atomic cimport atomic
from libc.stdio cimport printf
from libc.stdlib cimport rand
from posix.unistd cimport sysconf
from runtime.pthreads cimport *
from runtime.semaphore cimport *
cdef extern from "<unistd.h>" nogil:
enum: _SC_NPROCESSORS_ONLN # Seems to not be included in "posix.unistd".
cdef cypclass Scheduler
cdef cypclass Worker
# The 'inline' qualifier on this function is a hack to convince Cython to allow a definition in a .pxd file.
# The C compiler will dismiss it because we pass the function pointer to create a thread which prevents inlining.
cdef inline void * worker_function(void * arg) nogil:
worker = <lock Worker> arg
sch = <Scheduler> <void*> worker.scheduler
cdef int num_remaining_queues
# Wait until all the workers are ready.
pthread_barrier_wait(&sch.barrier)
while 1:
# Wait until a queue becomes available.
sem_wait(&sch.num_free_queues)
# If the scheduler is done there is nothing to do anymore.
if sch.is_done:
return <void*> 0
# Pop or steal a queue.
queue = worker.get_queue()
with wlocked queue:
# Do one task on the queue.
queue.activate()
if queue.is_empty():
# Mark the empty queue as not assigned to any worker.
queue.has_worker = False
# Decrement the number of non-completed queues.
if sch.num_pending_queues.fetch_sub(1) == 1:
# Signal that there are no more queues.
sem_post(&sch.done)
# Discard the empty queue and continue the main loop.
continue
# The queue is not empty: reinsert it in this worker's queues.
worker.queues.push_back(queue)
# Signal that the queue is available.
sem_post(&sch.num_free_queues)
cdef cypclass Worker:
deque[lock SequentialMailBox] queues
lock Scheduler scheduler
pthread_t thread
lock Worker __new__(alloc, lock Scheduler scheduler):
instance = consume alloc()
instance.scheduler = scheduler
locked_instance = <lock Worker> consume instance
if not pthread_create(&locked_instance.thread, NULL, worker_function, <void *> locked_instance):
return locked_instance
printf("pthread_create() failed\n")
lock SequentialMailBox get_queue(lock self):
# Get the next queue in the worker's list or steal one.
with wlocked self:
if not self.queues.empty():
queue = self.queues.front()
self.queues.pop_front()
return queue
return self.steal_queue()
lock SequentialMailBox steal_queue(lock self):
# Steal a queue from another worker:
# - inspect each worker in order starting at a random offset
# - skip this worker and any worker with an empty queue list
# - return the last queue of the first worker with a non-empty list
cdef int i, index, num_workers, random_offset
sch = <Scheduler> <void*> self.scheduler
num_workers = <int> sch.workers.size()
random_offset = rand() % num_workers
for i in range(num_workers):
index = (i + random_offset) % num_workers
victim = sch.workers[index]
if victim is self:
continue
with wlocked victim:
if not victim.queues.empty():
stolen_queue = victim.queues.back()
victim.queues.pop_back()
stolen_queue.has_worker = True
return stolen_queue
return NULL
int join(self):
# Join the worker thread.
return pthread_join(self.thread, NULL)
cdef cypclass Scheduler:
vector[lock Worker] workers
pthread_barrier_t barrier
sem_t num_free_queues
atomic[int] num_pending_queues
sem_t done
volatile bint is_done
lock Scheduler __new__(alloc, int num_workers=0):
self = <lock Scheduler> consume alloc()
if num_workers == 0: num_workers = sysconf(_SC_NPROCESSORS_ONLN)
sem_init(&self.num_free_queues, 0, 0)
sem_init(&self.done, 0, 0)
self.num_pending_queues.store(0)
if pthread_barrier_init(&self.barrier, NULL, num_workers + 1):
printf("Could not allocate memory for the thread barrier\n")
# Signal that no work will be done.
sem_post(&self.done)
return self
self.is_done = False
self.workers.reserve(num_workers)
for i in range(num_workers):
worker = Worker(self)
if worker is NULL:
# Signal that no work will be done.
sem_post(&self.done)
return self
self.workers.push_back(worker)
# Wait until all the worker threads are ready.
pthread_barrier_wait(&self.barrier)
return self
__dealloc__(self):
pthread_barrier_destroy(&self.barrier)
sem_destroy(&self.num_free_queues)
sem_destroy(&self.done)
void post_queue(self, lock SequentialMailBox queue):
# Add a queue to the first worker.
main_worker = self.workers[0]
with wlocked main_worker:
queue.has_worker = True
main_worker.queues.push_back(queue)
# Increment the number of non-completed queues.
self.num_pending_queues.fetch_add(1)
# Signal that a queue is available.
sem_post(&self.num_free_queues)
void finish(lock self):
# Wait until there is no more work.
done = &self.done
sem_wait(done)
# Signal the worker threads that there is no more work.
self.is_done = True
# Pretend that there are new queues to wake up the workers.
num_free_queues = &self.num_free_queues
for worker in self.workers:
sem_post(num_free_queues)
# Clear the workers to break reference cycles.
self.workers.clear()
cdef cypclass SequentialMailBox(ActhonQueueInterface):
deque[ActhonMessageInterface] messages
lock Scheduler scheduler
bint has_worker
__init__(self, lock Scheduler scheduler):
self.scheduler = scheduler
self.has_worker = False
bint is_empty(const self):
return self.messages.empty()
void push(locked self, ActhonMessageInterface message):
# Add a task to the queue.
self.messages.push_back(message)
if message._sync_method is not NULL:
message._sync_method.insertActivity()
# If no worker is already assigned this queue
# register it with the scheduler.
if not self.has_worker:
self.scheduler.post_queue(self)
bint activate(self):
# Try to process the first message in the queue.
cdef bint one_message_processed
if self.messages.empty():
return False
next_message = self.messages.front()
self.messages.pop_front()
one_message_processed = next_message.activate()
if one_message_processed:
if next_message._sync_method is not NULL:
next_message._sync_method.removeActivity()
else:
printf("Pushed front message to back :/\n")
self.messages.push_back(next_message)
return one_message_processed
cdef cypclass BatchMailBox(SequentialMailBox):
bint activate(self):
# Process as many messages as possible.
while not self.messages.empty():
next_message = self.messages.front()
self.messages.pop_front()
if not next_message.activate():
printf("Pushed front message to back :/\n")
self.messages.push_back(next_message)
return False
if next_message._sync_method is not NULL:
next_message._sync_method.removeActivity()
return True
cdef inline ActhonResultInterface NullResult() nogil:
return NULL
cdef extern from "<semaphore.h>" nogil:
ctypedef struct sem_t:
pass
int sem_init(sem_t *sem, int pshared, unsigned int value)
int sem_wait(sem_t *sem)
int sem_post(sem_t *sem)
int sem_getvalue(sem_t *, int *)
int sem_destroy(sem_t* sem)
from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
extensions = [
Extension(
"main",
language="c++",
sources=["main.pyx"],
include_dirs = [
"/srv/slapgrid/slappart6/srv/runner/shared/openssl/24bd61db512fe6e4e0d214ae77943d75/include",
"/srv/slapgrid/slappart6/srv/runner/shared/fmtlib/d524cc3d1a798a140778558556ec6d0c/include"
],
library_dirs = [
"/srv/slapgrid/slappart6/srv/runner/shared/openssl/24bd61db512fe6e4e0d214ae77943d75/lib",
"/srv/slapgrid/slappart6/srv/runner/shared/fmtlib/d524cc3d1a798a140778558556ec6d0c/lib"
],
libraries = ["crypto", "fmt"],
extra_link_args=[
"-Wl,-rpath=/srv/slapgrid/slappart6/srv/runner/shared/openssl/24bd61db512fe6e4e0d214ae77943d75/lib",
"-Wl,-rpath=/srv/slapgrid/slappart6/srv/runner/shared/fmtlib/d524cc3d1a798a140778558556ec6d0c/lib"
],
),
]
setup(
ext_modules = cythonize(extensions)
)
# distutils: language = c++
from stdlib.string cimport string
cdef extern from "<openssl/evp.h>" nogil:
ctypedef struct EVP_MD_CTX:
pass
ctypedef struct EVP_MD
ctypedef struct ENGINE
void EVP_MD_CTX_init(EVP_MD_CTX *ctx)
EVP_MD_CTX *EVP_MD_CTX_create()
int EVP_DigestInit_ex(EVP_MD_CTX *ctx, const EVP_MD *type, ENGINE *impl)
int EVP_DigestUpdate(EVP_MD_CTX *ctx, const void *d, unsigned int cnt)
int EVP_DigestFinal_ex(EVP_MD_CTX *ctx, unsigned char *md, unsigned int *s)
int EVP_MD_CTX_cleanup(EVP_MD_CTX *ctx)
void EVP_MD_CTX_destroy(EVP_MD_CTX *ctx)
int EVP_MD_CTX_copy_ex(EVP_MD_CTX *out, const EVP_MD_CTX *_in)
int EVP_DigestInit(EVP_MD_CTX *ctx, const EVP_MD *type)
int EVP_DigestFinal(EVP_MD_CTX *ctx, unsigned char *md, unsigned int *s)
int EVP_MD_CTX_copy(EVP_MD_CTX *out,EVP_MD_CTX *_in)
# Max size
const int EVP_MAX_MD_SIZE
# Algorithms
const EVP_MD *md5sum "EVP_md5"()
const EVP_MD *sha1sum "EVP_sha1"()
const EVP_MD *sha256sum "EVP_sha256"()
const EVP_MD *sha512sum "EVP_sha512"()
const EVP_MD *EVP_get_digestbyname(const char *name)
cdef extern from * nogil:
'''
static const char hexdigits [] = "0123456789abcdef";
'''
cdef const char hexdigits[]
cdef cypclass MessageDigest:
EVP_MD_CTX * md_ctx
MessageDigest __new__(alloc, const EVP_MD * algo):
md_ctx = EVP_MD_CTX_create()
if md_ctx is NULL:
return NULL
if EVP_DigestInit_ex(md_ctx, algo, NULL) != 1:
return NULL
instance = alloc()
instance.md_ctx = md_ctx
return instance
__dealloc__(self):
EVP_MD_CTX_destroy(md_ctx)
int update(self, unsigned char * message, size_t size):
return EVP_DigestUpdate(self.md_ctx, message, size) - 1
string hexdigest(self):
cdef char result[EVP_MAX_MD_SIZE*2]
cdef unsigned char hashbuffer[EVP_MAX_MD_SIZE]
cdef unsigned int size
cdef unsigned int i
if EVP_DigestFinal_ex(self.md_ctx, hashbuffer, &size) == 1:
for i in range(size):
result[2*i] = hexdigits[hashbuffer[i] >> 4]
result[2*i+1] = hexdigits[hashbuffer[i] & 15]
return string(result, 2*size)
from posix.types cimport ino_t
cdef extern from "<sys/types.h>" nogil:
ctypedef struct DIR
cdef extern from "<dirent.h>" nogil:
cdef struct struct_dirent "dirent":
ino_t d_ino
char d_name[256]
DIR *opendir(const char *name)
struct_dirent *readdir(DIR *dirp)
int readdir_r(DIR *dirp, struct_dirent *entry, struct_dirent **result)
int closedir(DIR *dirp)
from stdlib.string cimport string
cdef extern from "<fmt/printf.h>" namespace "fmt" nogil:
ctypedef struct FILE
int printf (const char* template, ...)
int fprintf (FILE *stream, const char* template, ...)
string sprintf (const char* template, ...)
# distutils: language = c++
# Differences with posix.stat:
#
# - the declaration for the non-standard field st_birthtime was removed
# because cypclass wrapping triggers the generation of a conversion
# function for the stat structure which references this field.
#
# - the absent declaration in posix.time of struct timespec was added.
#
# - the declarations for the time_t fields st_atime, st_mtime, st_ctime
# were replaced by the fields st_atim, st_mtim, st_ctim
# of type struct timespec.
from stdlib.string cimport string
from stdlib.fmt cimport sprintf
from posix.types cimport (blkcnt_t, blksize_t, dev_t, gid_t, ino_t, mode_t,
nlink_t, off_t, time_t, uid_t)
cdef extern from "<sys/time.h>" nogil:
cdef struct struct_timespec "timespec":
time_t tv_sec
long int tv_nsec
cdef extern from "<sys/stat.h>" nogil:
cdef struct struct_stat "stat":
dev_t st_dev
ino_t st_ino
mode_t st_mode
nlink_t st_nlink
uid_t st_uid
gid_t st_gid
dev_t st_rdev
off_t st_size
blksize_t st_blksize
blkcnt_t st_blocks
struct_timespec st_atim
struct_timespec st_mtim
struct_timespec st_ctim
# POSIX prescribes including both <sys/stat.h> and <unistd.h> for these
cdef extern from "<unistd.h>" nogil:
int fchmod(int, mode_t)
int chmod(const char *, mode_t)
int fstat(int, struct_stat *)
int lstat(const char *, struct_stat *)
int stat(const char *, struct_stat *)
# Macros for st_mode
mode_t S_ISREG(mode_t)
mode_t S_ISDIR(mode_t)
mode_t S_ISCHR(mode_t)
mode_t S_ISBLK(mode_t)
mode_t S_ISFIFO(mode_t)
mode_t S_ISLNK(mode_t)
mode_t S_ISSOCK(mode_t)
mode_t S_IFMT
mode_t S_IFREG
mode_t S_IFDIR
mode_t S_IFCHR
mode_t S_IFBLK
mode_t S_IFIFO
mode_t S_IFLNK
mode_t S_IFSOCK
# Permissions
mode_t S_ISUID
mode_t S_ISGID
mode_t S_ISVTX
mode_t S_IRWXU
mode_t S_IRUSR
mode_t S_IWUSR
mode_t S_IXUSR
mode_t S_IRWXG
mode_t S_IRGRP
mode_t S_IWGRP
mode_t S_IXGRP
mode_t S_IRWXO
mode_t S_IROTH
mode_t S_IWOTH
mode_t S_IXOTH
# Cypclass to expose minimal stat support.
cdef cypclass Stat:
struct_stat st_data
Stat __new__(alloc, string path):
instance = alloc()
if not lstat(path.c_str(), &instance.st_data):
return instance
bint is_regular(self):
return S_ISREG(self.st_data.st_mode)
bint is_symlink(self):
return S_ISLNK(self.st_data.st_mode)
bint is_dir(self):
return S_ISDIR(self.st_data.st_mode)
string to_json(self):
return sprintf("""{
"st_dev": %lu,
"st_ino": %lu,
"st_mode": %lu,
"st_nlink": %lu,
"st_uid": %d,
"st_gid": %d,
"st_rdev": %lu,
"st_size": %ld,
"st_blksize": %ld,
"st_blocks": %ld,
"st_atime": %ld,
"st_mtime": %ld,
"st_ctime": %ld,
"st_atime_ns": %ld,
"st_mtime_ns": %ld,
"st_ctime_ns": %ld
}""",
self.st_data.st_dev,
self.st_data.st_ino,
self.st_data.st_mode,
self.st_data.st_nlink,
self.st_data.st_uid,
self.st_data.st_gid,
self.st_data.st_rdev,
self.st_data.st_size,
self.st_data.st_blksize,
self.st_data.st_blocks,
self.st_data.st_atim.tv_sec,
self.st_data.st_mtim.tv_sec,
self.st_data.st_ctim.tv_sec,
self.st_data.st_atim.tv_nsec,
self.st_data.st_mtim.tv_nsec,
self.st_data.st_ctim.tv_nsec,
)
# Differences with libcpp.string:
#
# - declarations for operator+= have been added.
cdef extern from "<string>" namespace "std" nogil:
size_t npos = -1
cdef cppclass string:
cppclass iterator:
iterator()
char& operator*()
iterator(iterator &)
iterator operator++()
iterator operator--()
bint operator==(iterator)
bint operator!=(iterator)
cppclass reverse_iterator:
char& operator*()
iterator operator++()
iterator operator--()
iterator operator+(size_t)
iterator operator-(size_t)
bint operator==(reverse_iterator)
bint operator!=(reverse_iterator)
bint operator<(reverse_iterator)
bint operator>(reverse_iterator)
bint operator<=(reverse_iterator)
bint operator>=(reverse_iterator)
cppclass const_iterator(iterator):
pass
cppclass const_reverse_iterator(reverse_iterator):
pass
string() except +
string(const char *) except +
string(const char *, size_t) except +
string(const string&) except +
# as a string formed by a repetition of character c, n times.
string(size_t, char) except +
# from a pair of iterators
string(iterator first, iterator last) except +
iterator begin()
const_iterator const_begin "begin"()
iterator end()
const_iterator const_end "end"()
reverse_iterator rbegin()
const_reverse_iterator const_rbegin "rbegin"()
reverse_iterator rend()
const_reverse_iterator const_rend "rend"()
const char* c_str()
const char* data()
size_t size()
size_t max_size()
size_t length()
void resize(size_t)
void resize(size_t, char c)
size_t capacity()
void reserve(size_t)
void clear()
bint empty()
iterator erase(iterator position)
iterator erase(const_iterator position)
iterator erase(iterator first, iterator last)
iterator erase(const_iterator first, const_iterator last)
char& at(size_t)
char& operator[](size_t)
char& front() # C++11
char& back() # C++11
int compare(const string&)
string& operator+=(const string&)
string& operator+=(char)
string& operator+=(const char *)
string& append(const string&)
string& append(const string&, size_t, size_t)
string& append(const char *)
string& append(const char *, size_t)
string& append(size_t, char)
void push_back(char c)
string& assign (const string&)
string& assign (const string&, size_t, size_t)
string& assign (const char *, size_t)
string& assign (const char *)
string& assign (size_t n, char c)
string& insert(size_t, const string&)
string& insert(size_t, const string&, size_t, size_t)
string& insert(size_t, const char* s, size_t)
string& insert(size_t, const char* s)
string& insert(size_t, size_t, char c)
size_t copy(char *, size_t, size_t)
size_t find(const string&)
size_t find(const string&, size_t)
size_t find(const char*, size_t pos, size_t)
size_t find(const char*, size_t pos)
size_t find(char, size_t pos)
size_t rfind(const string&, size_t)
size_t rfind(const char* s, size_t, size_t)
size_t rfind(const char*, size_t pos)
size_t rfind(char c, size_t)
size_t rfind(char c)
size_t find_first_of(const string&, size_t)
size_t find_first_of(const char* s, size_t, size_t)
size_t find_first_of(const char*, size_t pos)
size_t find_first_of(char c, size_t)
size_t find_first_of(char c)
size_t find_first_not_of(const string&, size_t)
size_t find_first_not_of(const char* s, size_t, size_t)
size_t find_first_not_of(const char*, size_t pos)
size_t find_first_not_of(char c, size_t)
size_t find_first_not_of(char c)
size_t find_last_of(const string&, size_t)
size_t find_last_of(const char* s, size_t, size_t)
size_t find_last_of(const char*, size_t pos)
size_t find_last_of(char c, size_t)
size_t find_last_of(char c)
size_t find_last_not_of(const string&, size_t)
size_t find_last_not_of(const char* s, size_t, size_t)
size_t find_last_not_of(const char*, size_t pos)
string substr(size_t, size_t)
string substr()
string substr(size_t)
size_t find_last_not_of(char c, size_t)
size_t find_last_not_of(char c)
#string& operator= (const string&)
#string& operator= (const char*)
#string& operator= (char)
string operator+ (const string& rhs)
string operator+ (const char* rhs)
bint operator==(const string&)
bint operator==(const char*)
bint operator!= (const string& rhs )
bint operator!= (const char* )
bint operator< (const string&)
bint operator< (const char*)
bint operator> (const string&)
bint operator> (const char*)
bint operator<= (const string&)
bint operator<= (const char*)
bint operator>= (const string&)
bint operator>= (const char*)
#!/usr/bin/env python3
# Developed with Python 3.8.5
import sys
import os
import stat
import traceback
import hashlib
import io
import multiprocessing
import json
def compute_hashes(entry_path):
with open(entry_path, mode="rb") as f:
md5 = hashlib.md5()
sha1 = hashlib.sha1()
sha256 = hashlib.sha256()
sha512 = hashlib.sha512()
while True:
data = f.read(io.DEFAULT_BUFFER_SIZE)
md5.update(data)
sha1.update(data)
sha256.update(data)
sha512.update(data)
if len(data) < io.DEFAULT_BUFFER_SIZE:
break
return {"md5": md5.hexdigest(),
"sha1": sha1.hexdigest(),
"sha256": sha256.hexdigest(),
"sha512": sha512.hexdigest()}
def stat_result_to_dict(stat_result):
return {
"st_mode": getattr(stat_result, "st_mode", None),
"st_ino": getattr(stat_result, "st_ino", None),
"st_dev": getattr(stat_result, "st_dev", None),
"st_nlink": getattr(stat_result, "st_nlink", None),
"st_uid": getattr(stat_result, "st_uid", None),
"st_gid": getattr(stat_result, "st_gid", None),
"st_size": getattr(stat_result, "st_size", None),
"st_atime": getattr(stat_result, "st_atime", None),
"st_mtime": getattr(stat_result, "st_mtime", None),
"st_ctime": getattr(stat_result, "st_ctime", None),
"st_blocks": getattr(stat_result, "st_blocks", None),
"st_blksize": getattr(stat_result, "st_blksize", None),
"st_rdev": getattr(stat_result, "st_rdev", None),
"st_flags": getattr(stat_result, "st_flags", None),
"st_gen": getattr(stat_result, "st_gen", None),
"st_birthtime": getattr(stat_result, "st_birthtime", None),
}
def construct_fs_tree(mp_pool=None, mp_tasks=[], cur_dict=None, path="/", dev_whitelist=None, ignored_dirs=[]):
is_first_call = False
if mp_pool == None:
is_first_call = True
mp_pool = multiprocessing.Pool()
if cur_dict == None:
cur_dict = {"stat": stat_result_to_dict(os.stat(path, follow_symlinks=False)),
"childs": dict()}
if dev_whitelist != None:
path_stat = cur_dict["stat"]
if "st_dev" in path_stat:
if not path_stat["st_dev"] in dev_whitelist:
return cur_dict
for dir in ignored_dirs:
if path.startswith(dir):
cur_dict["ignored"] = True
return cur_dict
try:
with os.scandir(path) as it:
for entry in it:
try:
entry_path = os.fsdecode(entry.path)
entry_name = os.fsdecode(entry.name)
try:
entry_stat = os.stat(entry_path, follow_symlinks=False)
except Exception:
entry_stat = None
cur_dict["childs"][entry_name] = {"stat": stat_result_to_dict(entry_stat),
"childs": dict()}
if stat.S_ISDIR(entry_stat.st_mode):
construct_fs_tree(mp_pool=mp_pool, mp_tasks=mp_tasks, cur_dict=cur_dict["childs"][entry_name],
path=entry_path, dev_whitelist=dev_whitelist, ignored_dirs=ignored_dirs)
elif stat.S_ISREG(entry_stat.st_mode):
mp_tasks.append({"result": mp_pool.apply_async(compute_hashes, [entry_path]),
"merge_into": cur_dict["childs"][entry_name]})
elif stat.S_ISLNK(entry_stat.st_mode):
cur_dict["childs"][entry_name]["symlink_target"] = os.readlink(
entry_path)
except Exception:
pass
except Exception:
pass
if is_first_call == True:
mp_pool.close()
for task in mp_tasks:
try:
result = task["result"].get()
for k in iter(result):
task["merge_into"][k] = result[k]
except Exception:
pass
mp_pool.join()
return cur_dict
def main():
ignored_dirs = ["/opt/slapgrid", "/srv/slapgrid"]
dev_whitelist = list()
for path in [".", "/", "/boot"]:
try:
dev_whitelist.append(
os.stat(path, follow_symlinks=False).st_dev)
except FileNotFoundError:
pass
tree = construct_fs_tree(path=".", dev_whitelist=dev_whitelist, ignored_dirs=ignored_dirs)
with open('result.json', 'w') as fp:
json.dump(tree, fp, indent=2, separators=(',', ': '))
if __name__ == "__main__":
main()
[package]
name = "main"
version = "0.1.0"
authors = ["Leo Le Bouter <leo.le.bouter@nexedi.com>", "Xavier Thompson <xavier.thompson@nexedi.com>"]
edition = "2018"
[dependencies]
md-5 = "0.9.1"
sha-1 = "0.9.1"
sha2 = "0.9.1"
hex = "0.4.2"
anyhow = "1.0.32"
rmp-serde = "0.14.4"
nix = "0.18.0"
serde = { version = "1.0.115", features = ["derive"] }
base64 = "0.12.3"
rayon = "1.4.0"
serde_json = "1.0.57"
[profile.release]
opt-level = 'z'
lto = true
codegen-units = 1
use anyhow::Result;
use rayon::prelude::*;
use serde::Serialize;
use std::collections::HashMap;
use std::{
fs::DirEntry,
fs::File,
io::{Read, Write},
path::PathBuf,
path::Path,
sync::{Arc, Mutex},
};
#[derive(Debug, Serialize)]
struct FileStat {
st_dev: u64,
st_ino: u64,
st_nlink: u64,
st_mode: u32,
st_uid: u32,
st_gid: u32,
st_rdev: u64,
st_size: i64,
st_blksize: i64,
st_blocks: i64,
st_atime: i64,
st_atime_nsec: i64,
st_mtime: i64,
st_mtime_nsec: i64,
st_ctime: i64,
st_ctime_nsec: i64,
}
impl From<nix::sys::stat::FileStat> for FileStat {
fn from(x: nix::sys::stat::FileStat) -> Self {
Self {
st_dev: x.st_dev,
st_ino: x.st_ino,
st_nlink: x.st_nlink,
st_mode: x.st_mode,
st_uid: x.st_uid,
st_gid: x.st_gid,
st_rdev: x.st_rdev,
st_size: x.st_size,
st_blksize: x.st_blksize,
st_blocks: x.st_blocks,
st_atime: x.st_atime,
st_atime_nsec: x.st_atime_nsec,
st_mtime: x.st_mtime,
st_mtime_nsec: x.st_mtime_nsec,
st_ctime: x.st_ctime,
st_ctime_nsec: x.st_ctime_nsec,
}
}
}
#[derive(Default, Debug, Serialize)]
struct Tree {
childs: HashMap<String, Tree>,
stat: Option<FileStat>,
ignored: bool,
symlink_target: Option<String>,
md5: Option<String>,
sha1: Option<String>,
sha256: Option<String>,
sha512: Option<String>,
}
fn construct_fs_tree(
cur_tree: Option<Tree>,
path: &PathBuf,
dev_whitelist: &Vec<u64>,
ignored_dirs: &Vec<PathBuf>,
) -> Result<Tree> {
let mut cur_tree = match cur_tree {
Some(cur_tree) => cur_tree,
None => Tree {
stat: Some(nix::sys::stat::lstat(path)?.into()),
..Tree::default()
},
};
if !dev_whitelist.iter().any(|x| match &cur_tree.stat {
Some(stat) if stat.st_dev == *x => true,
_ => false,
}) {
return Ok(cur_tree);
}
if ignored_dirs.iter().any(|x| path.starts_with(x)) {
cur_tree.ignored = true;
return Ok(cur_tree);
}
let entries: Vec<Result<DirEntry, _>> = match std::fs::read_dir(&path) {
Ok(x) => x,
_ => return Ok(cur_tree),
}
.collect();
let cur_tree = {
let cur_tree = Arc::new(Mutex::new(cur_tree));
entries.par_iter().for_each(|entry| match entry {
Ok(entry) => {
let mut tree = Tree {
stat: match nix::sys::stat::lstat(&entry.path()) {
Ok(s) => Some(s.into()),
_ => None,
},
..Tree::default()
};
match entry.file_type() {
Ok(file_type) if file_type.is_dir() => {
tree = construct_fs_tree(
Some(tree),
&entry.path(),
dev_whitelist,
ignored_dirs,
)
.unwrap();
}
Ok(file_type) if file_type.is_file() => {
if let Ok(mut file) = std::fs::File::open(&entry.path()) {
use md5::{Digest, Md5};
use sha1::Sha1;
use sha2::{Sha256, Sha512};
let mut md5 = Md5::new();
let mut sha1 = Sha1::new();
let mut sha256 = Sha256::new();
let mut sha512 = Sha512::new();
let buf: &mut [u8] = &mut [0; 64 * 1024];
loop {
match file.read(buf) {
Ok(0) => {
tree.md5 = Some(hex::encode(md5.finalize()));
tree.sha1 = Some(hex::encode(sha1.finalize()));
tree.sha256 = Some(hex::encode(sha256.finalize()));
tree.sha512 = Some(hex::encode(sha512.finalize()));
break;
}
Ok(n) => {
md5.update(&buf[0..n - 1]);
sha1.update(&buf[0..n - 1]);
sha256.update(&buf[0..n - 1]);
sha512.update(&buf[0..n - 1]);
}
Err(e) if e.kind() == std::io::ErrorKind::Interrupted => {
continue;
}
Err(e) => {
eprintln!("{:#?}", e);
break;
}
};
}
}
}
Ok(file_type) if file_type.is_symlink() => {
tree.symlink_target = match std::fs::read_link(&entry.path()) {
Ok(target) => match target.to_str() {
Some(target_str) => Some(String::from(target_str)),
None => Some(String::from("<invalid unicode error>")),
}
Err(_) => None,
}
}
_ => {}
};
{
let mut locked = cur_tree.lock().unwrap();
locked
.childs
.insert(entry.file_name().to_str().unwrap().to_string(), tree);
}
}
_ => {}
});
Arc::try_unwrap(cur_tree).unwrap().into_inner().unwrap()
};
Ok(cur_tree)
}
fn main() -> Result<()> {
let ignored_dirs = ["/opt/slapgrid", "/srv/slapgrid"]
.iter()
.map(PathBuf::from)
.collect();
let disk_partitions = [".", "/", "/boot"];
let dev_whitelist = disk_partitions
.iter()
.filter_map(|p| match nix::sys::stat::lstat(&PathBuf::from(p)) {
Ok(stat) => Some(stat.st_dev),
Err(_) => None,
})
.collect();
let tree = construct_fs_tree(
None,
&PathBuf::from("."),
&dev_whitelist,
&ignored_dirs,
)?;
let result = serde_json::to_string_pretty(&tree)?;
let path = Path::new("result.json");
let mut file = File::create(&path).unwrap();
file.write_all(result.as_bytes()).unwrap();
Ok(())
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment