Dictionary Challenge

Source: https://gist.github.com/seanthehead/11180933

words_test.md

Write a program which, given a dictionary, generates two output files, 'sequences' and 'words'. 'sequences' should contain every sequence of four letters (case insensitive) that appears in exactly one word of the dictionary, one sequence per line. 'words' should contain the corresponding words that contain the sequences, in the same order, again one per line.

For example, given the trivial dictionary containing only

  • arrows
  • carrots
  • give
  • me

The outputs should be:

  • 'sequences'
  • carr
  • give
  • rots
  • rows
  • rrot
  • rrow
  • 'words'
  • carrots
  • give
  • carrots
  • arrows
  • carrots
  • arrows

Of course, 'arro' does not appear in the output, since it is found in more than one word.

For the final solution, read in the following dictionary file: http://bit.ly/1jveLkY

Dictionary Solution

Notes:

  • Added the ability to search for varying sequence lengths and appearance threshold
  • In accordance with the Challenge, the input values default to 4 and 1 respectively but can be changed by the user.
  • If no dictionary file is provided by the user the default dictionary will be used (provided by challenge requirements)

My Thoughts

Initially I started to use the Yii2 framework to create models to utilize the database to do the computations but after running some tests with larger sample sizes it started using too much RAM because it was loading entire model objects into memory. I could have just switched to writing a PHP CLI script to complete the task but I decided to start cataloging my projects and challenges in one place, so I just let the DictionaryForm->process() (link) method handle the calculations.

I added some functionality to the challenge by allowing the user to choose the sequence length and the possible number of occurrences. The original request asked that I showcase some of my various skills and proactively thinking about the overall functionality of an interface is one of those skills.

TODO:

  • Separate out various parts of the process() method into other methods to follow "Open/Close Principals"
  • Use model constants for sequence length and occurrences options when the model is initialized.
  • Reduce computations? I'm there is a better way to process the data to reduce the bigO, I'm just not sure how to accomplish that right now.
  • I need to do a complete test to confirm my results are accurate.