Indexing and Searching Features of Special Interest to Forensics Users
Before You Start, See:
Note: dtSearch does not alter original files or other data, including Hash values, in indexing, searching and display of documents.
New in Beta MD5 Hash Support
MD5 hashes are unique numerical codes that are sometimes used in forensics to identify files. dtSearch now has an option in the V. 7.86 beta to generate an MD5 hash for each document as it is indexed and append the field to the document text as searchable metadata. Note: this option will make indexing slower.
Federated Searching and the dtSearch Spider
dtSearch products provide federated search across any number of directories, emails (with nested attachments), and databases. No need to tell dtSearch what files, emails or other content you have; dtSearch will figure that out for itself. Indexing, searching and display of documents does not alter original files or other data, including Hash values. More
dtSearch’s own document filters support parsing, indexing, searching and display with highlighted hits of text and metadata across a broad range of online and offline data types.
The dtSearch Spider adds local and remote online content to a search. The Spider can index sites to any level of depth, with support for public and private or secure online content. For private or secure sites, the dtSearch Spider supports log-ins and forms-based authentication. For public sites, the dtSearch Spider is a “polite” spider and will not index and search a site with a flag that indicates it does not accept spiders or robots.
For convenient offline access, the dtSearch Spider also includes a caching option, to store the full spidered content along with the index. (Without caching, the Spider has to return to the relevant URL to display the full content with highlighted hits.) The caching option can also be used with other, non-spidered data as well.
PDFs and OCR
Searchable PDFs. dtSearch recommends using an OCR program like Adobe Acrobat to OCR scanned images into PDF “searchable image” format. If a PDF file is an “image only” PDF (i.e. a PDF where even though you can see text, you cannot cut and paste that text), use an OCR program like Adobe Acrobat to turn the “image only” PDF into a “searchable image” PDF.
Starting in Version 7.84, the View Log function available after an index updates can now report “image-only” PDF files in the index update log. This new reporting will indicate if there are any PDF files that require OCR to enable full-text search. (Metadata in “image only” PDFs files is indexed in any case.)
The "searchable image" format works by storing the complete original image of a scanned document, along with the text obtained through OCR. The text is "hidden" in the sense that simply opening the PDF file displays only the scanned image, not the underlying OCR’ed text. Because the OCR’ed text is "hidden" in the file, however, dtSearch can index and search it.
Hit-Highlighting in PDFs. The dtSearch PDF Search Highlighter is a free plug-in which runs inside Adobe Reader, making it possible for Adobe Reader to highlight hits in PDFs. More information and dtSearch PDF Search Highlighter downloads
As illustrated above, for "searchable image" PDFs, the dtSearch PDF Search Highlighter will appear to highlight hits directly on the image. (Developers, as an alternative to the client-running dtSearch plug-in, dtSearch third-party developer Contegra Systems developed a server-side only PDF hit-highlighting solution, bypassing the need to have each end-user separately install software to add hit highlights to PDF files. Please email firstname.lastname@example.org for more information.)
PDF Portfolios. A PDF portfolio can contain any combination of PDF or other document types. dtSearch converts the whole PDF portfolio, with all embedded documents, to a single formatted text stream and indexes it as a single file. After a search, dtSearch can highlight hits by converting the PDF portfolio contents to HTML. (An alternative to the "single file" treatment for developers is to use the ExtractionOptions object in the API to break apart the PDF portfolio and extract its components.)
PDF Encryption. dtSearch products support indexing PDF files with 40-bit RC4, 128-bit RC4, 128-bit AES, and 256-bit AES encryption. Exception: if a PDF file has “Restrict editing and printing of the document” selected, and does not have “Enable copying ...” selected, then dtSearch products will not access that PDF file.
To access security and permission settings in a PDF file, open the file in Adobe Acrobat, click File from the top menu, then select Properties and then select the Security tab. If you select Password Security from the drop-down list, one of the options is “Restrict editing and printing of the document.” If that is selected, and “Enable copying ...” is not selected, then dtSearch products will not access that PDF file until those settings are changed. Adobe Acrobat requires entering the original password to change these settings.
To index encrypted PDFs that dtSearch does not support (i.e. PDFs with the “Enable copying ...” permission disabled or PDFs requiring a separate password), make a temporary, decrypted copy of the encrypted files, index the decrypted copy, and then replace the temporary decrypted copy with the encrypted versions. This one-time unencryption is sufficient for dtSearch operation. dtSearch does not need to unencrypt the PDF files to search and display them with highlighted hits once the original index is complete.
Developers: PDF encryption support is implemented in a new component, dtv_pdfCrypto.dll, that is subject to export restrictions. Please see Section 12.4 of the dtSearch setup license agreement for more information.
After an index update completes, click "View Log" to see a report that will include information on any encrypted or unreadable files that the indexer could not process. This report can be accessed at any time in the index folder in the file Index_LastUpdateErrors.html. The report indicates which files were (a) encrypted, (b) corrupt, (c) partially encrypted, and (d) partially corrupt. Partially encrypted or corrupt files are files that could be indexed in part but that included some encrypted or corrupt data, such as an email with an encrypted attachment. (See above topic for additional information on encrypted PDFs.)
dtSearch can automatically recognize
dates, email addresses, and credit card numbers,
and search for these items by type. Through
this feature, dtSearch can, for example, search
for a credit card number regardless of how
it may be formatted, or search for a range
of dates even if the dates are expressed in
different text formats (January 15, 2005, through
2/19/07). dtSearch can also extract all dates,
emails and credit card numbers from a collection
of documents. More information
dtSearch offers a Unicode filtering feature for automatic recovery of text from corrupt forensically-recovered documents and large data blocks, such as those recovered through an "undelete" process, from unallocated computer space, or from partially recovered file fragments. The filtering algorithm can scan recovered data blocks using multiple Unicode and other text encoding detection methods. More information
dtSearch products offer several options for indexing and searching emails and other messages, contacts, appointments and the like. All methods support indexing, searching and display with highlighted hits of item content, metadata and attachments, including multilevel nested email attachments.(i) Direct indexing of message archives. dtSearch can index and search OST (Exchange) files (starting in V. 7.77) MSG/PST (Outlook/Exchange) files, DBX (Outlook Express) files, and EML/MBOX (Thunderbird, Eudora, etc.) files just like any other supported file types. If you are not searching your own personal email collection (and sometimes even if you are searching your own emails and have a large collection), it is much more efficient to bypass the Outlook/MAPI “middleman,” and directly access the data. More on PSTs
(ii) MAPI indexing of message archives. For "live" content in an Outlook profile, dtSearch also offers the option of going through MAPI/Outlook. This option further supports launching a message, contact, appointment, task or note in its native application. For example, you can search for a message in dtSearch, launch the message in Outlook, and then reply to the message using Outlook. Because this approach requires that the content be “live” in Outlook, this option tends to be more suitable for personal email indexing than for larger volume email indexing.
(iii) MAPI extraction of message archives. dtSearch also supports extracting Outlook and Exchange data to MSG files. The MSG conversion approach in dtSearch works through a command-line tool to extract items in bulk from larger volumes of Outlook or Exchange data. The converted MSG files will include all properties of the original files, including any nested attachments. Following conversion, dtSearch can index the resulting MSG files, including highlighting hits in messages and attachments. More information
(iv) Single document vs. message/attachment separation option. By default, dtSearch indexes each message with all attachments (recursively unpacked and appended to the message body) as a single document. Alternatively, the File Types table can separate the message body and each attachment into its own separate document container. More information
(v) PST files are the email archives that Microsoft Outlook uses to store messages. Because PSTs have a complex, database-like internal structure, opening a PST file to extract a message (either to view or copy) takes much longer than extracting one item from a ZIP or MBOX archive. This can make reviewing retrieved items after a search slow and cumbersome. The following options may speed up the review process.
First, use dtSearch MAPI extraction to extract the PST items into a folder tree of MSG files, and then index the MSG files. (Before indexing, you can ZIP the folder tree of MSG files into a single ZIP, if you prefer to keep the messages in one file.) After a search, dtSearch can quickly open each MSG item as you click on it, even if it is in a ZIP file.
Second, index the PST file directly with dtSearch, but enable caching of documents in the index. With this feature enabled, dtSearch will ZIP-compress the MSG items and store them in the index, so they can be quickly accessed after a search. Both options will let you quickly open message items after a search. A disadvantage of the second option is that this approach will enable caching for the entire index, not just the email archive.
dtSearch also supports Thunderbird (MBOX/EML), including nested email attachments.
Fuzzy searching uses a proprietary algorithm to find search terms even if they are misspelled. dtSearch recommends fuzzy searching for searching emails, OCR’ed text, or any other text that may contain misspellings.
Search fuzziness adjusts from 0 to 10 so you can fine-tune fuzziness to the level of OCR or typographical errors in your files. A search for alphabet with a fuzziness of 1 would find alphaqet; with a fuzziness of 3, it would find both alphaqet and alpkaqet. Fuzziness is not built into the index, so you can vary fuzziness at the time of each search. More information on fuzzy and other search options
dtSearch includes Unicode-compatible file parsing, to convert input data to Unicode. dtSearch automatically recognizes all Unicode-supported encodings, representing hundreds of international languages.
The following dtSearch search options work automatically on text in any international language: phrase; Boolean; proximity and directed proximity; wildcard; macro; numeric range; fielded data / metadata search options; fuzzy searching (adjustable from 0 to 10 to account for typographical or OCR errors); and relevancy-ranked searching (including natural language vector-space ranking, positional scoring options, general variable term weighting, variable term weighting in fields, and other API-based document classification and sorting options). More information
Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words. Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word. To make this type of text searchable, enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters, so each character will be treated as single word.
Note: this setting will only affect text identified as Unicode Chinese, Japanese or Korean text; it will not affect text identified as other Unicode character sets.
For documents in certain formats that do not include encoding information, such as single-byte text files, dtSearch provides a proprietary language recognition algorithm for detecting text in a large variety of languages (Western European, other European, Middle-Eastern, etc.). This algorithm is enabled by default.
A search in dtSearch will always include white-on-white text and similar “invisible” text in files. dtSearch also includes options for searching embedded objects in Microsoft Office documents, and normally hidden content in HTML.
While HTML comments, scripts, links, and styles are not by default included in indexing, dtSearch has an option to include these.
A similar option searches hidden content (such as Macros or other embedded objects) in Microsoft Office files.
To index and search the contents of a disk image, mount the disk image so it is visible as part of the Windows file system. dtSearch can then index the mounted image like other Windows folders. (For programmers, the DataSource API supports direct indexing of disk images.)
dtSearch provides an option to search for a list of words. Under this option, a special dialog box provides a way to search for a long list of words, and create a list of matching files, in a single step. This option can work with the full range of dtSearch search features (Boolean, fuzzy, natural language, etc.). More information
For expanding a search for a specific set of word or words to a user-defined list of concepts or synonyms, dtSearch also offers a user-defined thesaurus add-on to the comprehensive English-language thesaurus included with dtSearch.
A search report is good option for quickly going through a large number of retrieved files. A search report lists each hit found in each of the documents retrieved in a search with a specified number of words or paragraphs of context surrounding it.
dtSearch’s Edit › Copy File function lets you copy all or selected documents retrieved from a search to a folder. You can optionally preserve the full path and filename in the copy, and you can preserve creation and last access times as well as the last modified date. Copy File also gives you the option of copying a single item in a container file such as a ZIP or PST archive, rather than copying the whole container. More information.
The dtSearch Publish product can quickly publish forensically retrieved (or e-discovery retrieved) documents to portable media. The resulting product provides instant search and display access to the document set. The portable media can run with zero footprint, requiring no installation on the end-user's computer.
Please see Mirroring Searchable Web Content on Portable Media article for an overview of how dtSearch Publish works.Back To Top