I just processed over 1.4 million tiff images into 35 thousand PDF documents in 24 hours. I'm geek excited and wanted to talk about it. A client is expecting to replace halFILE with SPImage, they need a way to access the historical scans without the maintenance and cost of keeping HalFILE running. So a method to search the database and display images was in order. In other words, they need an internal website.
That took a week and was a piece of cake for the most part. A big hurdle is how the HAL scans were saved and the general point of this post. When scanning from a TWAIN scanner halFILE, and plenty other systems, store scans as one tiff file per page. Browsers don't like tiff files. There are stop gap solutions for web display, none as good as converting to PDF. Another latent IT issue is the management problem presented by 4 million files, lots of IT things are slower, drive management, backups, or data management all take extra time. Sometimes a lot of extra time, causing unnecessary down time.
How it's setup
Files are saved with an index number for the scan, using the page number as the file extension. There were 4 million tiff files in this image structure in 107 thousand scans. To convert the tiff images to PDF I wrote a utility which starts in the Master Images folder. From there it creates a duplicate of the underlying folder structure converting the contained individual TIFF one page per image into a normal multi page PDF format. It also grabs files with TXT or HAL extensions as they are used for data recovery. Any image files that can't be read are saved in an Error folder with a tif extension so they are easy to look at in Windows.
halFile itself doesn't much care what format the scan is in, so the converted images can be slid into the current image structure. The hal search process will find and display them in the PDF viewer. This project won't keep HAL running, so the website will quickly searches the HAL database and displays the PDF in a browser, Chrome is currently required. Since it's an internal site, there isn't a login and it allows a selection of documents to be emailed to the user. A nice side benefit for a site still running halFILE is since the search is in a browser it can be done from any workstation. Be sure to check hal licensing before doing that.
Four Million Images
The first bit of code I put together processed just under 2 pages per second. Four million is a fairly large number, in seconds it is almost 7 weeks. I quickly got that down to a few weeks and started a conversion process slated to take 16 days. It wasn't stressing the computer in any fashion. With a little multithreading magic I was able to push my hardware to the limit. Current code processes 1.47 Million pages per day. That's over 16 pages per second at 40 meg a minute. A DVD sized folder takes an hour and twenty minutes.
If you have a HalFILE or a similar document management system and want to convert them to the more popular PDF format for browser display or other reasons, give Dan a call at 937-424-5734 we can help with the conversion process. There are a multitude of tiff formats even if you have a different file type, let us know, we can likely convert your data.
Breaking a million pages a day for some reason made me smile. It wasn't that long ago we didn't have a million files on the network. To me watching a computer combine images at a rate over a million pages a day is crazy, and my computer is 2 years old. It is stout, running an i7-3770K with 32G RAM, SSD boot, Seagate Barracuda RAID 1 array on an ASUS Maximus V Extreme. The process is bound on the processer at this point, throwing more hardware will result in faster processing times. I'm good with a million a day, give a call with your data conversion need.