The speech to text part is very complicated (Hidden Markov model type of algorithms). Win XP onwards supports this so I assume it's an academic exercise.
You need an electret mike, an amp with at least 50dB of voltage gain (preferably with soft limiting/compressing), anti-aliasing filter and an ADC to sample at least 11025 Hz at probably 10+bits for speech. Easily done with modern codecs, but I suspect you need a broken down system with independant copmponents. You can combine the amplifier/limiter and anti-aliasing filter by using a low grade audio op-amp with a bit of high frequency roll off in the feedback circuit .